fpp3 9.11, Ex9

Consider aus_arrivals, the quarterly number of international visitors to Australia from several countries for the period 1981 Q1 – 2012 Q3. a. Select one country and describe the time plot.

aus_arrivals |>
  filter(Origin == "Japan") |>
  autoplot(Arrivals)

  • There is an increasing trend to about 1996, and slowly decreasing thereafter.
  • The seasonal shape has changed considerably over time.
  1. Use differencing to obtain stationary data.
aus_arrivals |>
  filter(Origin == "Japan") |>
  gg_tsdisplay(Arrivals |> difference(lag=4) |> difference(), plot_type = "partial")

  1. What can you learn from the ACF graph of the differenced data?
  • The non-seasonal lags suggest an MA(1) component.
  • The seasonal lags suggest a seasonal MA(1) component
  1. What can you learn from the PACF graph of the differenced data?
  • The non-seasonal lags might suggest an AR(1) component. But the lag 2 spike in the PACF is larger than the lag 2 spike in the ACF. So an MA(1) is probably better.
  • The seasonal lags show geometric decay which is consistent with a seasonal MA(1).
  1. What model do these graphs suggest?
  • The suggested model is an ARIMA(0,1,1)(0,1,1).
  1. Does ARIMA() give the same model that you chose? If not, which model do you think is better?
aus_arrivals |>
  filter(Origin == "Japan") |>
  model(ARIMA(Arrivals)) |>
    report()
## Series: Arrivals 
## Model: ARIMA(0,1,1)(1,1,1)[4] 
## 
## Coefficients:
##           ma1     sar1     sma1
##       -0.4228  -0.2305  -0.5267
## s.e.   0.0944   0.1337   0.1246
## 
## sigma^2 estimated as 174801727:  log likelihood=-1330.66
## AIC=2669.32   AICc=2669.66   BIC=2680.54

The resulting model has an additional seasonal AR(1) component compared to what I guessed. We can compare the two models based on the AICc statistic:

aus_arrivals |>
  filter(Origin == "Japan") |>
  model(
    guess = ARIMA(Arrivals ~ pdq(0,1,1) + PDQ(0,1,1)),
    auto = ARIMA(Arrivals)
  ) |>
  glance()
## # A tibble: 2 × 9
##   Origin .model     sigma2 log_lik   AIC  AICc   BIC ar_roots  ma_roots 
##   <chr>  <chr>       <dbl>   <dbl> <dbl> <dbl> <dbl> <list>    <list>   
## 1 Japan  guess  177223035.  -1332. 2670. 2670. 2678. <cpl [0]> <cpl [5]>
## 2 Japan  auto   174801727.  -1331. 2669. 2670. 2681. <cpl [4]> <cpl [5]>

The automatic model is only slightly better than my guess based on the AICc statistic.

  1. Write the model in terms of the backshift operator, then without using the backshift operator.

\[ (1-B)(1-B^4)(1-\Phi B^4)y_t = (1+\theta B)(1+\Theta B^4) \varepsilon_t \] \[ \left[1-B - (1 + \Phi)B^4 + (1 + \Phi) B^5 + \Phi B^8 - \Phi B^9\right]y_t = (1+\theta B + \Theta B^4 + \theta\Theta B^5) \varepsilon_t \]

\[ y_t - y_{t-1} - (1 + \Phi)y_{t-4} + (1 + \Phi) y_{t-5} + \Phi y_{t-8} - \Phi y_{t-9} = \varepsilon_t + \theta \varepsilon_{t-1} + \Theta \varepsilon_{t-4} + \theta\Theta \varepsilon_{t-5}. \]

\[ y_t = y_{t-1} + (1 + \Phi)y_{t-4} - (1 + \Phi) y_{t-5} - \Phi y_{t-8} + \Phi y_{t-9} + \varepsilon_t + \theta \varepsilon_{t-1} + \Theta \varepsilon_{t-4} + \theta\Theta \varepsilon_{t-5}. \]

fpp3 9.11 Ex 10

Choose a series from us_employment, the total employment in different industries in the United States.

  1. Produce an STL decomposition of the data and describe the trend and seasonality.
leisure <- us_employment |>
  filter(Title == "Leisure and Hospitality")
leisure |>
  autoplot(Employed)

  • The sudden change in the seasonal pattern is probably due to some change in the definition of who is counted in this group. So our STL decomposition will need to have a small seasonal window to handle that.
  • In addition, the variation changes a little as the level increases, so we will also use a square root transformation.
leisure |>
  model(STL(sqrt(Employed) ~ season(window=7))) |>
  components() |>
  autoplot()

  • With such a long series, it is not surprising to see the seasonality change a lot over time.
  • The seasonal pattern changed in the 1990s to what is is now.
  • The period of change was rapid, and the seasonal component hasn’t fully captured the change, leading to some seasonality ending up in the remainder series.
  • The trend is increasing.
  1. Do the data need transforming? If so, find a suitable transformation.

Yes. A square root did ok – the remainder series is relatively homoscedastic. No transformation or log transformations led to the remainder series appearing to be heteroscedastic.

leisure |> features(Employed, guerrero)
## # A tibble: 1 × 2
##   Series_ID     lambda_guerrero
##   <chr>                   <dbl>
## 1 CEU7000000001          -0.216
  • The automatically selected transformation is close to logs.
  • My preference is for something a little larger.
  • I think the automatic procedure is confusing the changing seasonality with the increasing variance.
  1. Are the data stationary? If not, find an appropriate differencing which yields stationary data.
leisure |>
  autoplot(sqrt(Employed) |> difference(lag=12) |> difference())

The double differenced logged data is close to stationary, although the variance has decreased over time.

  1. Identify a couple of ARIMA models that might be useful in describing the time series. Which of your models is the best according to their AICc values?
leisure |>
  gg_tsdisplay(sqrt(Employed) |> difference(lag=12) |> difference(), plot_type="partial")

  • This suggests that an ARIMA(2,1,0)(0,1,1) would be a good start.
  • An alternative would be an ARIMA(0,1,2)(0,1,1).
fit <- leisure |>
  model(
    arima210011 = ARIMA(sqrt(Employed) ~ pdq(2,1,0) + PDQ(0,1,1)),
    arima012011 = ARIMA(sqrt(Employed) ~ pdq(0,1,2) + PDQ(0,1,1))
  )
glance(fit)
## # A tibble: 2 × 9
##   Series_ID     .model      sigma2 log_lik   AIC  AICc   BIC ar_roots  ma_roots
##   <chr>         <chr>        <dbl>   <dbl> <dbl> <dbl> <dbl> <list>    <list>  
## 1 CEU7000000001 arima210011 0.0380    207. -406. -406. -386. <cpl [2]> <cpl>   
## 2 CEU7000000001 arima012011 0.0381    206. -404. -404. -384. <cpl [0]> <cpl>

The ARIMA(2,1,0)(0,1,1) model is better.

  1. Estimate the parameters of your best model and do diagnostic testing on the residuals. Do the residuals resemble white noise? If not, try to find another ARIMA model which fits better.
fit |>
  select(arima210011) |>
  gg_tsresiduals()

The tails of the residual distribution are too long, and there is significant autocorrelation at lag 11, as well as some smaller significant spikes elsewhere.

fit <- leisure |>
  model(
    arima210011 = ARIMA(sqrt(Employed) ~ pdq(2,1,0) + PDQ(0,1,1)),
    arima012011 = ARIMA(sqrt(Employed) ~ pdq(0,1,2) + PDQ(0,1,1)),
    auto = ARIMA(sqrt(Employed))
  )
glance(fit)
## # A tibble: 3 × 9
##   Series_ID     .model      sigma2 log_lik   AIC  AICc   BIC ar_roots  ma_roots
##   <chr>         <chr>        <dbl>   <dbl> <dbl> <dbl> <dbl> <list>    <list>  
## 1 CEU7000000001 arima210011 0.0380    207. -406. -406. -386. <cpl [2]> <cpl>   
## 2 CEU7000000001 arima012011 0.0381    206. -404. -404. -384. <cpl [0]> <cpl>   
## 3 CEU7000000001 auto        0.0365    226. -440. -440. -411. <cpl [2]> <cpl>
fit |> select(auto) |> report()
## Series: Employed 
## Model: ARIMA(2,1,2)(0,1,1)[12] 
## Transformation: sqrt(Employed) 
## 
## Coefficients:
##          ar1      ar2      ma1     ma2     sma1
##       1.6261  -0.9132  -1.4773  0.7937  -0.5443
## s.e.  0.0400   0.0309   0.0535  0.0352   0.0340
## 
## sigma^2 estimated as 0.03655:  log likelihood=226.22
## AIC=-440.44   AICc=-440.35   BIC=-411.26

The automatically selected ARIMA(2,1,2)(0,1,1) model is better than either of my selections.

fit |>
  select(auto) |>
  gg_tsresiduals()

The residuals look better, although there is still a significant spike at lag 11.

  1. Forecast the next 3 years of data. Get the latest figures from https://fred.stlouisfed.org/categories/11 to check the accuracy of your forecasts.
fc <- fit |>
  forecast(h = "3 years")
fc |>
  filter(.model=="auto") |>
  autoplot(us_employment |> filter(year(Month) > 2000))

update <- readr::read_csv("data/CEU7000000001.csv") |>
  mutate(
    Month = yearmonth(DATE),
    Employed = CEU7000000001
  ) |>
  select(Month, Employed) |>
  as_tsibble(index=Month) |>
  filter(Month >= min(fc$Month))
fc |> accuracy(update)
## # A tibble: 3 × 10
##   .model      .type     ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>       <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 arima012011 Test  -2887. 3426. 2887. -22.3  22.3   NaN   NaN 0.706
## 2 arima210011 Test  -2885. 3425. 2885. -22.3  22.3   NaN   NaN 0.706
## 3 auto        Test  -2931. 3463. 2931. -22.7  22.7   NaN   NaN 0.705
fc |>
  filter(.model=="auto") |>
  autoplot(us_employment |> filter(year(Month) > 2000)) +
  geom_line(data=update, aes(x=Month, y=Employed), col='red')

The initial forecasts look great, but then the pandemic led to a huge impact on the employment in this industry.

  1. Eventually, the prediction intervals are so wide that the forecasts are not particularly useful. How many years of forecasts do you think are sufficiently accurate to be usable?

Given the pandemic, about 5 months. Otherwise, perhaps 2–3 years.

fpp3 9.11, Ex11

Choose one of the following seasonal time series: the Australian production of electricity, cement, or gas (from aus_production).

  1. Do the data need transforming? If so, find a suitable transformation.
aus_production |>
  autoplot(Electricity)

Yes, these need transforming.

lambda <- aus_production |>
  features(Electricity, guerrero) |>
  pull(lambda_guerrero)
aus_production |>
  autoplot(box_cox(Electricity, lambda))

guerrero() suggests using Box-Cox transformation with parameter \(\lambda=0.52\).

  1. Are the data stationary? If not, find an appropriate differencing which yields stationary data.

The trend and seasonality show that the data are not stationary.

aus_production |>
  gg_tsdisplay(box_cox(Electricity, lambda) |> difference(4), plot_type = "partial")

aus_production |>
  gg_tsdisplay(box_cox(Electricity, lambda) |> difference(4) |> difference(1), plot_type = "partial")

  • It seems that we could have continued with only taking seasonal differences. You may try this option.
  • We opt to take a first order difference as well.
  1. Identify a couple of ARIMA models that might be useful in describing the time series. Which of your models is the best according to their AIC values?

From the above graph, an AR(1) or an MA(1) with a seasonal MA(2) might work. So an ARIMA(1,1,0)(0,1,2) model for the transformed data.

fit <- aus_production |>
  model(
    manual = ARIMA(box_cox(Electricity, lambda) ~ 0 + pdq(1, 1, 0) + PDQ(0, 1, 2)),
    auto = ARIMA(box_cox(Electricity, lambda))
  )
fit |>
  select(auto) |>
  report()
## Series: Electricity 
## Model: ARIMA(1,1,4)(0,1,1)[4] 
## Transformation: box_cox(Electricity, lambda) 
## 
## Coefficients:
##           ar1     ma1      ma2      ma3      ma4     sma1
##       -0.7030  0.2430  -0.4477  -0.1553  -0.2452  -0.5574
## s.e.   0.1739  0.1943   0.0973   0.0931   0.1189   0.1087
## 
## sigma^2 estimated as 18.02:  log likelihood=-609.08
## AIC=1232.17   AICc=1232.71   BIC=1255.7
glance(fit)
## # A tibble: 2 × 8
##   .model sigma2 log_lik   AIC  AICc   BIC ar_roots  ma_roots 
##   <chr>   <dbl>   <dbl> <dbl> <dbl> <dbl> <list>    <list>   
## 1 manual   19.7   -620. 1247. 1248. 1261. <cpl [1]> <cpl [8]>
## 2 auto     18.0   -609. 1232. 1233. 1256. <cpl [1]> <cpl [8]>

Automatic model selection with ARIMA() has also taken a first order difference, and so we can compare the AICc values. This is a challenging ARIMA model to select manually and the automatic model is clearly better.

  1. Estimate the parameters of your best model and do diagnostic testing on the residuals. Do the residuals resemble white noise? If not, try to find another ARIMA model which fits better.
fit |>
  select(auto) |>
  gg_tsresiduals()

# to find out the dof for Ljung Box test
fit |> select(auto) |> report()
## Series: Electricity 
## Model: ARIMA(1,1,4)(0,1,1)[4] 
## Transformation: box_cox(Electricity, lambda) 
## 
## Coefficients:
##           ar1     ma1      ma2      ma3      ma4     sma1
##       -0.7030  0.2430  -0.4477  -0.1553  -0.2452  -0.5574
## s.e.   0.1739  0.1943   0.0973   0.0931   0.1189   0.1087
## 
## sigma^2 estimated as 18.02:  log likelihood=-609.08
## AIC=1232.17   AICc=1232.71   BIC=1255.7
tidy(fit)
## # A tibble: 9 × 6
##   .model term  estimate std.error statistic  p.value
##   <chr>  <chr>    <dbl>     <dbl>     <dbl>    <dbl>
## 1 manual ar1    -0.346     0.0648    -5.34  2.33e- 7
## 2 manual sma1   -0.731     0.0876    -8.34  9.16e-15
## 3 manual sma2    0.0820    0.0858     0.956 3.40e- 1
## 4 auto   ar1    -0.703     0.174     -4.04  7.40e- 5
## 5 auto   ma1     0.243     0.194      1.25  2.13e- 1
## 6 auto   ma2    -0.448     0.0973    -4.60  7.15e- 6
## 7 auto   ma3    -0.155     0.0931    -1.67  9.68e- 2
## 8 auto   ma4    -0.245     0.119     -2.06  4.04e- 2
## 9 auto   sma1   -0.557     0.109     -5.13  6.52e- 7
fit |>
  select(auto) |>
  augment() |>
  features(.innov, ljung_box, dof = 6, lag = 12)
## # A tibble: 1 × 3
##   .model lb_stat lb_pvalue
##   <chr>    <dbl>     <dbl>
## 1 auto      8.45     0.207

Residuals look reasonable, they resemble white noise.

  1. Forecast the next 24 months of data using your preferred model.
fit |>
  select(auto) |>
  forecast(h = "2 years") |>
  autoplot(aus_production)

  1. Compare the forecasts obtained using ETS().
aus_production |>
  model(ETS(Electricity)) |>
  forecast(h = "2 years") |>
  autoplot(aus_production)

aus_production |>
  model(ETS(Electricity)) |>
  forecast(h = "2 years") |>
  autoplot(aus_production |> filter(year(Quarter) >= 2000)) +
  autolayer(fit |> select(auto) |> forecast(h = "2 years"), colour = "red", alpha = 0.4)
## Scale for fill_ramp is already present.
## Adding another scale for fill_ramp, which will replace the existing scale.

  • The point forecasts appear to be quite similar.
  • The ETS forecasts have a wider forecast interval than the ARIMA forecasts.

fpp3 9.11, Ex12

For the same time series you used in the previous exercise, try using a non-seasonal model applied to the seasonally adjusted data obtained from STL. Compare the forecasts with those obtained in the previous exercise. Which do you think is the best approach?

lambda <- aus_production |>
  features(Electricity, guerrero) |>
  pull(lambda_guerrero)
stlarima <- decomposition_model(
  STL(box_cox(Electricity, lambda)),
  ARIMA(season_adjust)
)
fc <- aus_production |>
  model(
    ets = ETS(Electricity),
    arima = ARIMA(box_cox(Electricity, lambda)),
    stlarima = stlarima
  ) |>
  forecast(h = "2 years")
fc |> autoplot(aus_production |> filter(year(Quarter) > 2000), level=95)

The STL-ARIMA approach has higher values and narrower prediction intervals. It is hard to know which is best without comparing against a test set.

fpp3 9.11, Ex13

For the Australian tourism data (from tourism): a. Fit a suitable ARIMA model for all data. b. Produce forecasts of your fitted models. c. Check the forecasts for the “Snowy Mountains” and “Melbourne” regions. Do they look reasonable?

fit <- tourism |>
  model(arima = ARIMA(Trips))
fc <- fit |> forecast(h="3 years")
fc |>
  filter(Region == "Snowy Mountains") |>
  autoplot(tourism)

fc |>
  filter(Region == "Melbourne") |>
  autoplot(tourism)

Both sets of forecasts appear to have captured the underlying trend and seasonality effectively.